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Abstract 

A framework is introduced for solving a sequence of slowly changing optimization problems, including those 
arising in regression and classification applications, using optimization algorithms such as stochastic gradient descent 
(SGD). The optimization problems change slowly in the sense that the minimizers change at either a fixed or bounded 
rate. A method based on estimates of the change in the minimizers and properties of the optimization algorithm is 
introduced for adaptively selecting the number of samples needed from the distributions underlying each problem in 
order to ensure that the excess risk, i.e., the expected gap between the loss achieved by the approximate minimizer 
produced by the optimization algorithm and the exact minimizer, does not exceed a target level. Experiments with 
synthetic and real data are used to confirm that this approach performs well. 


1 Introduction 

Consider solving a sequence of machine learning problems such as regression or classification by minimizing the 
expected value of a fixed loss function £(a;, z) at each time ns: 

^^n{fn{x)=E^„^p„[e{x,Zn)]^ Vn>l (1) 

For regression, Zn corresponds to the predictors and response pair at time n and x parameterizes the regression model. 
For classification Zn cotTesponds to the feature and label pair at time n and x parameterizes the classifier. Although, 
motivated by regression and classification, our framework works for any loss function £{x,z) that satisfies certain 
properties discussed later. In the learning context, a task consists of the loss function i{x,z) and the distribution 
and so our problem can be viewed as learning a sequence of tasks. 

The problems change slowly at a constant but unknown rate in the sense that 

\\x:-x:_,\\=p Vn>2 (2) 

with a;* the minimizer of fn{x). In an extended version of this paper [?], we also consider slow changes at a bounded 
but unknown rate 

Vn>2 (3) 

Under this model, we find approximate minimizers x^ of each function fn{x) using Kn samples from distribution 
p„ by applying an optimization algorithm. We evaluate the quality of our approximate minimizers through an 
excess risk criterion e„, i.e., 

E[/«(a:„)]-/„«) < e« 

*This work was supported by the NSF under award CCF 11-11342 through the University of Illinois at Urbana-Champaign. 
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which is a standard criterion for optimization and learning problems HI. Our goal is to determine adaptively the 
number of samples required to achieve a desired excess risk e for each n with p unknown. As p is unknown, we 
will construct estimates of p. Given an estimate of p, we determine selection rules for the number of samples Kn to 
achieve a target excess risk e. 

1.1 Related Work 

Our problem has connections with multi-task learning (MTL) and transfer learning. In multi-task learning, one tries 
to learn several tasks simultaneously as in El.El, and m by exploiting the relationships between the tasks. In transfer 
learning, knowledge from one source task is transferred to another target task either with or without additional training 
data for the target task El. Multi-task learning could be applied to our problem by running a MTL algorithm each time 
a new task arrives, while remembering all prior tasks. However, this approach incurs a memory and computational 
burden. Transfer learning lacks the sequential nature of our problem. For multi-task and transfer learning, there are 
theoretical guarantees on regret for some algorithms 0. 

We can also consider the concept drift problem in which we observe a stream of incoming data that potentially 
changes over time, and the goal is to predict some property of each piece of data as it arrives. After prediction, we 
incur a loss that is revealed to us. For example, we could observe a feature w„ and predict the label y„ as in Q. 
Some approaches for concept drift use iterative algorithms such as SGD, but without specific models on how the data 
changes. As a result, only simulation results showing good performance are available. There are also some bandit 
approaches in which one of a finite number of predictors must be applied to the data as in |l8]. For this approach, there 
are regret guarantees using techniques for analyzing bandit problems. 

Another relevant model is sequential supervised learning (see ID) in which we observe a stream of data consisting 
of feature/label pairs {w„^y„) at time n, with w„ being the feature vector and y„ being the label. At time n, we want 
to predict y„ given x„. One approach to this problem, studied in ifTOl and ifTTI . is to look at L consecutive pairs 
and develop a predictor at time n by applying a supervised learning algorithm to this training data. 
Another approach is to assume that there is an underlying hidden Markov model (HMM) ifT^ . The label y„ represents 
the hidden state and the pair (wn,y„} represents the observation with y„ being a noisy version of y„. HMM inference 
techniques are used to estimate y„. 


2 Adaptive Sequential Optimization With p Known 

For analysis, we need the following assumptions on our functions fn(x) and the optimization algorithm: 

A.l For the optimization algorithm under consideration, there is a function b(do,K„) such that 

E[f„{x„)]-f„{x*) < b{do,K„) 

with K„ the number of samples from p„ and E||a:„(0) — a:*|p < do, where a;„(0) is the initial point of the 
optimization algorithm at time n. Finally, b{do,Kn) is non-decreasing in do. 

A.l Each loss function i{x,z) is differentiable in x. Each /„ (x) is strongly convex with parameter m, i.e., 

fn{y) > fn{x) + {Vxfn{x),y- x) + ]^m\\y - x\\'^ 

A.3 diam(j?r) < -foo 

A.4 We can hnd initial points Xi and X 2 that satisfy the excess risk criterion with ei and £2 known, i.e., 

^[Mxi)]-fi{x*)<ei 1 = 1,2 
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Remarks: For assumption lA. 1 I we assume that the bound b{do,K„) depends on the number of samples K„ and not 
the number of iterations. For SGD, generally the number of iterations equals as each sample is used to produce a 
noisy gradient. In addition, we often set a:„(0) = See Appendix lAl for a discussion of useful b{do,K„) bounds. 

For assumption lA.4 I we can hx Ki and set e, = h(diam(,^)^,fir,) for i = 1,2. 

Now, we examine the case when the change in minimizers, p in (|2]l or ([3i, is known. For the analysis of the section, 
whether ©or 0 holds does not affect the analysis. Later we will estimate p and in this case whether ([2| or 0 holds 
matters substantially. 

We want to hnd abound e„ on the excess risk at time n in terms ofKn and p, i.e., e„ such that E[/„(a:„)] — fn{x*) < e„. 
The idea is to start with the bounds from assumption IA.4 I and proceed inductively using the previous e„_ i and p 
from 0. Suppose that e„_i bounds the excess risk at time n — 1. Using the triangle inequality, strong convexity, and 
0 we have 

+ Ha:*- 

< + a;*_ j 11 

In comparison, we could use the estimate diam^(,^) to bound E||a:„_i — a:*|p and select Kn- If the bound in 0 is 
much smaller than diam(,^)^, then we need signihcantly fewer samples K„ to guarantee a desired excess risk. Now, 
by using the bound b{do,K„) from assumption lA. 1 I we can set 

which yields a sequence of bounds on the excess risk. Note that this recursion only relies on the immediate past at 
time « — 1 through e„_i. To achieve e„ < e for all n, we set 

Ki — min{/r > 1 | h (diam(j?r)^,/r) < e} 

and K„ = K* for n>2 with 

/ST* = min i > 1 




3 Estimating p 

In practice, we do not know p, so we must construct an estimate p„ using the samples from each distribution p„. We 
introduce two approaches to estimate p at one time step, \\x* — x*_^ ||, and methods to combine these estimates under 
assumptions 0 and 0. We show that for our estimate p„ and appropriately chosen sequences {f„} for all n large 
enough Pn+tn> p almost surely. With this property, analysis similar to that in Section|2]holds. 


3.1 Allowed Ways to Choose Kn 

One of the sources of difficulty in estimating p is that we will allow Kn to be selected in a data dependent way, so K„ 
is itself a random variable. We make the assumption that Kn is selected using only information available at the end of 
time n — 1. To make this precise we dehne a hltration of sigma algebras to describe the available information. First, 
we dehne the sigma algebra JKq containing all the information on the initial conditions of our algorithm. For example, 
we may start at a random point xq and then 

JKq = a{xQ) 
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The sigma algebra ^ may also contain information about K\ and Ki- Next, we define the filtration 

X = (T({ 2 :„W}fii) Vn>l (6) 

where 

is the merge operator for sigma algebras. The sigma algebra contains all the information available to us at the 
end of time n. We assume that Kn is -measurable to capture the idea that K„ is chosen only using information 
available at the end of time n —I. 


3.2 Estimating One Step Change 

First, we estimate the one step changes ||a;* — denoted by p,. Implicitly, we assume that all one step estimates 
are capped by diam(,^), since trivially ||a:* — a:*_j || < diam(,^). 


3.2.1 Direct Estimate 

First, we construct an estimate p, of the one step changes \\x* — Using the triangle inequality and variational 

inequalities from O yields 


m m 


We then approximate || V^/;(a;,)|| = [V^£{xi,Zi)] || by 


— ^V^£(xi,Zi{k)) 
k=i 


to yield the following estimate that we call the direct estimate: 


Pi = ||a:/-a;;-i|| +- 
m 




k=l 


Ki-l 


Ki-i 


k=l 


3.2.2 Vector Integral Probability Metric Estimate 

Given a class of functions ^ where each f € maps an integral probability metric (IPM) lfT4l between two 

distributions p and q is defined to be 


Y^{p,q) = sup \E^^p[f{z)]-Ei^q[f{z)]\ 

We consider an extension of this idea, which we call a vector IPM, in which the class of functions maps ^ ^ SI: 

Y^ip,q) = sup \\E^,^p[f{z)]-Ei„.p[f{z)]\\ (7) 

Lemma[T]shows that a vector IPM can be used to bound the change in minimizer at time i and follows from variational 
inequalities in lfT3l and the assumption that {V^i^x, ■) : x G C . 

Lemma 1. Assume that {Vx^{x,-) : xG S’} C Then ||a;* — < ^y^{pi,Pi-\)- 
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Proof. By exploiting variational inequalities from 113, we can show that 


m 

= [Vxi{xU,zf] -E., [V^i{xU,Zi-i)] II 


By assumption {'V^£(x* ^,-) : x G C so 
||V^/;-(a;*_i)-V^/;_i(a:*_i)|| = 


||E.,.p, II 

sup ||E^,_p,. [f{zi)] - E^,._j^p,_j [fizi^i)] II 


r^{Pi,Pi-i) 


□ 


We cannot compute this vector IPM, since we do not know the distributions p, and p,_ i. Instead, we plug in 
the empiricals p, and p,_i to yield the estimate — 7^(p,,p,-i). This estimate is biased upward, which ensures that 

||a;r-®*-ill <E[^7 ^(p,-,P;_i)]. 

Our estimate is still not in a closed form since there is a supremum over in the computation of 7^(pi,P(-i). 
For the class of functions 

^={/| ll/W-/(5)ll<K^,^)}- (8) 

we can compute an upper bound F,- on y^[pi,Pi-\) yielding a computable estimate pi = Tf,-. Set Zi{k) = zfk) if 
I <k<Ki and ifk) = z,_ i {k) if /T,- + 1 <k< Ki + i. From O, we have 


7^ {pi, Pi-1) = sup 




A'=l 


1 


Ki-i 

Y^fiziiKi + k)) 


k=l 


We can relax this supremum by maximizing over the function value f{zi{k)) denoted by in the following non- 
convex quadratically constrained quadratic program (QCQP): 


maximize 


1 


1 


K: 


1-1 


4=1 <:=1 

subjectto ||ai.-a^|| < r{zi{k),zfj)) \/k < j 


L 


+k 


The constraints are imposed to ensure that the function values can correspond to a function in from ([8]l. The 
value of this QCQP exactly may not equal the vector IPM but at least provides an upper bound. Finally, we note that 
this QCQP can be converted to its dual form to yield an SDP, which is often easier to solve. 


3.2.3 Comparison of Estimates 

The direct estimate is easier to compute but may be loose if ||ai„ — a;*|| is large. If ||a;„ — a:*|| is large, then the vector 
IPM approach is in general tighter. However, the vector IPM is more difficult to compute due to need to solve a QCQP 
or SDP and check the inclusion conditions in Lemma [T] Also, the number of constraints in the QCQP or SDP grows 
quadratically in the number of samples. 


3.3 Combining One Step Estimates For Constant Change 

Assuming that ||a;* — || = p from dU, we average the one step estimates p, to yield a better estimate 

1 " 

P« = 7 52 P‘ 

of p at each time n under (|2]l. To analyze the behavior of our combined estimates, we use sub-Gaussian concentration 
inequalities detailed in AppendixlBl Lemma|22]is of particular importance to our analysis. 
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3.3.1 Direct Estimate 


The difficulty in analyzing the direct estimate comes because in approximating ^|| V/;(a:,)|| by 


1 

m 


k=i 


Xi is dependent on all the samples To illustrate the problem further, consider drawing two independent 

copies {zi(k)'\^!^^ pi and p,- of the samples. Suppose that we use the second copy to 

compute Xi using our optimization algorithm of choice starting from Then we approximate ^|| V/,(a;,)|| by 


1 

m 


1 

Y. E 

k=l 


Zi{k)) 


Now, since a:, is independent of {zi{k)}f^^ the quantity 


1 

m 


^]^v^e(xi,Zi{k)) 

k=l 


is the norm of an average of independent random variables conditioned on at,. This allows us to apply standard 
concentration inequalities for norms of random variables as in ifTSll . In this section, we argue that re-using the samples 
{zi{k)}^'^^ to compute a;,- is not too far from using a second independent draw {zi{k)}^'^y 
For analysis, we need the following additional assumptions; 

B.l The loss function £{x,z) has uniform Lipschitz continuous gradients in x with modulus L, i.e. 


\\V^£{x,z)-V^£{x,z)\\ <L||a;-a:|| Vz e 


B.2 Assuming is t/-dimensional, each component j of the gradient error Va.^(a;,z„) — f„{x) satisfies 


E 


exp 


|s(V,„f(a;,z„)-V/„(a:))^.| 


, 1 Q 2 


Assumption IB.l I is reasonable if the space ^ containing z is compact. Although in practice, the distribution of 
gradient error could depend on x, we assume that the bound Cg does not depend on x. We can view this as a 
pessimistic assumption corresponding to choosing the worst case bound as a function of x and the resulting Cg. This 
is a common assumption for in high probability analysis of optimization algorithms as in m for example. 

To proceed, we first define two other useful estimates for p. As discussed before, suppose that we make a second 
independent draw of samples {ziik)}^'^^ from p,. We use these samples to compute S, in the same manner as Xi 
starting from a;,_i except with {z, (k)}^^j used in place of Then define 


m 


■pE V,„£(S,-,z,(k)) 
k=\ 


1 

m 


1 

-p — E Va,£(®;-i,z,_i(k)) 
^<■-1 i:=l 


This is the same form as the direct estimate with S, in place of a;, . Next, define 

pP 4 ||2,._2;_j|| + l||V/,-(a;,')|| +-||V/;-_i(a;,'_i)|| 
m m 

This is in fact the bound that inspired the direct estimate. We also define the averaged estimates 


a{2) A 
Pn — 


tT/ 


( 2 ) 
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and 


n — 1 


(3) 


„(3) „ „ (2) (2) /^^3) 

We know that Pn > P- Thus, if we can control the gap between the pair p„ and p„ ’ and the pair p„ and p„ , then 

we can ensure that p„ plus an appropriate constant upper bounds p for all n large enough as desired. 

First, we show that p„ upper bounds p eventually. 

Lemma 2. Suppose that the following conditions hold: 

1. \BJWT\ hold 

2. The sequence {f„} satisfies 




n=2 


(n — \ )nft„ 
12C, 


oo 


/V (2) '^(2) 

Then for all n large enough it holds that p„ ^ +C„ ^ +t„ > p almost surely with 


AT.) A 

\^n — ' 


IQ 


Co / Cg 


—+2y^ \ —+\ — 

dm{n—l) \ V -^1 V Ki V K„ 

Proof First, we have by the triangle equality and reverse triangle inequality 

1 


x(-{xuZi{k)) 


k=\ 


< 


< 


k=\ 


-||v./i-(*,OII 1 + 

-\\^^fi{Xi] 


1 


Ki-i 




- \\^g^fi-\{Xi-\] 


^'■-1 ^=1 


- \\^xfi-\[x_ 




1 

E i'^xi{xi,Zi{k))-V^fi{xi)) 

k=i 


1 


Ki-i 


K: 




Then by the triangle inequality, we have 




—^—E 

m{n- 1)^2 


'-1 k=i 


1 

— E {'^^Kxi,Zi{k))-Vg,fi{xi)) 
Ai' k=i 


1 


E {'^a,£{Xi-UZi-l{k))-'Vggfi_i{Xi_i)) 


k=l 


< 


i(n — 1) 


1 


n—\ 


+2E 

i=2 




^=1 


1 

E {'^xi{Xn, Z„{k)) -Vxf„{x„)) 


k=l 

(2) A(3)I 


(9) 


We will analyze the behavior of this bound on jp^ — P; | using Lemma|22]in AppendixlBl Define the filtration 

^i = (y( lJ{5;(k)}fij ) \/jeo i = Q,...,n 


( 10 ) 


W=i 


2=1 


7 































































with ^ from (|6|. Note that C ^i-\, so Kt is -measurable. In addition, but not a;, is -measurable. 
Define the random variables 




1 

k=i 


-E 


k=i 



i = 


Clearly, V,- is -measurable, since V, is a function of Ki, and of which are -measurable. Condi¬ 

tioned on 1 , the sum 

1 

^ Y (H) 

k=i 

is a sum of iid random variables. We now work with the conditional measure P{- | ,^-i} to compute sub-Gaussian 
norms of (fTTT i define in (l24l i and (l25T l of AnnendixlBl By assumr)tion lB.2 I we have 


[{V:^t{Xi,Zi{k)) - V^fi{Xi))^ < ^ 

Therefore, applying Lemma l24l yields 

B\Y i'^3:K^i,Zi{k)) - \^^fi{Xi)) 

\k=i 

due to the independence conditioned on ^,-1. By applying Lemma |25] from ini to the conditional distribution 
P{’|=^i-i}, we have 



P< 


1 

Y. Y {'^x^{xi,Zi{k))-V^fi(xi)) 


k=\ 


> t 


< 2exp't- 


= 2exp< — 


2(VW)^ 

Kif' 


Since 


we have 


E 


1 

Y Y {'^a^i{xi,Zi{k))-V^fi{xi)) 

k=i 


^i-i 


> 0 , 






1 

— Y {'^a^KXi,Zi{k)) - V^fi{Xi)) 

k=l 


-E 


1 

^ Y i'^=^^{xi,Zi{k))-V^fi{xi)) 

k=i 




> t 


■^i-i 


< 


1 

Y {^a^i{Xi,Zi{k))-V^fi{Xi)) 

k=l 


> t 




< 2exp < — 


Kit^ 




8 















































Since E[V,' | ^,_i] = 0, we can apply Lemma|26]with c = l/(2Cg) to yield 


E I ^,_i] < exp I i (18Q)s^| 


This shows that the collection of random variables and the filtration {=^j}'Lo satisfies the conditions of 

Lemma|22] Before applying Lemma|22] we bound the conditional expectations 


E 


K: 


k=l 


^i-\ 


By a straightforward calculation conditioned on we have 


E 


1 

— xi{xi,Zi{k)) - y^fi{xi)) 

k=l 






k=lj=l 
Ki 


= 4 E IE \\\y^i{xt,z,{k)) - V,/(S,)lP I 

k=\ 

- 4 E E - v./(x,-))21 

k=\q=l 

[b] 1 ^ .C, 

K, 

C, 


-Kfti 


< 


dKj 


where (a) is a decomposition into each component of the vector and (b) follows since a centered sub-Gaussian random 
variable with parameter Cg jd^ satisfies 


C, 


E[{V^i{xi,Zi{k))-V^f{xi))l I 


Then by Jensen’s inequality 


E 


1 

Y. E i'^=o^(.Xi,Zi{k))-V^fi{Xi)) 


k=l 




< 



Define the constants 


1 


Cl 2 — ^fi — 


m{n — 1) 


Cl2 — ’ ’ ’ — — 1 — 


i(n — 1) 


resulting in 


l2 ~ ~2 


(n- 1) 
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Using the bound in (|9]l and Lemma|22]from AppendixlBlwith this choice of a, it holds that 


lpr^-pri>E«u 



<p 


, i=i 


1=1 

1 

k=i 


> E 


1 

E ^ xfi(xi)) 

k=i 




H" t 


=p < E ^ 


< exp < — 


-{ti-iy 

12Ca 


Combining this bound with > p yields 


Ep<| pF<p-E«'\/:^-f«I’ ^ E'P’^pF<AF-E«'\/:^-f« 




(2) / A (3) 


c. 


n=2 


i=\ 


dKi 


n=2 


1=1 


dKi 


< Ei’PAi-’-pf’i>E®\/5 

n=2 y i=l V 

^ f m2(n—l)f2' 

s -^72cr^'^" 


■ hi 


n=2 t 

The result follows from the Borel-Cantelli lemma. Note that as claimed 

n— 1 

’E 

1=2 


r(2) _ 1 


_ /^+ 2 V /^+ 

c/m(n — 1) I V -^fi ^\Ki V -K,! 


□ 

„ „ ( 2 ) 

Next, we show that p„ upper bounds p„ eventually with a general assumption on the optimization algorithm. 
When the conditions of Lemmas |2] and [3] are satished, it holds that p„ plus a constant upper bounds p. 

Lemma 3. Suppose the following conditions hold: 

1. B.l-B.2hold 


2. There exist bounds 


3. The sequence {f„} satisfies 


IE[||a;i-*i|| I ^i-i] <C{Ki) i=l,...,n 

{n-i)y ^ 


y\ exp <- 

n=2 [ 2n {l + diarn^y^) ) 

^ ^ /V (21 

Then for all n large enough it holds that Pn + Cn + t„> p„ almost surely with 


< +°o 


A A (1 +F 
'^n — 1 

n—\ 


(^{Ki)+2''^C{Ki)+C{Kn^ 


lO 


















Proof. We have by the triangle inequality, reverse triangle inequality, and the Lipschitz continuity of V^i{x,z) in x 
from assumption IB.l I 


so 


\pi-pP\ < 

1 


+ 


+ 


k=i 


'-1 A=i 


Ki 


— Y,^a,(-{Xi,Zi{k)) 
k=l 


Ki-i 


-p — Y. 

^'■-1 


< \\{Xi-Xi)-{Xi_i-Xi_\) 


1 

H— 
m 


+ - 


Ki 


-p- Y x(-{xi,Zi{k)) - V^e{xi,Zi{k))) 
k=i 


Ki-i 


Ki 




Y i'^iBi{xi-uZi-i{k))-V^e{xi^i,Zi^i{k))) 


k=l 


< ( 1 + - {\\Xi-Xi\\+\\Xi_i-Xi_i\ 

m , 


|P«-pP| < ^tEIA'-P; 


(2) I 


«-l tl 


n— 1 


i\\xi-Xi\\+2Y\\Xi-Xi\\ + \\^n 

«- 1 V f^2 


-Xn 


We will again apply Lemma|22]of AppendixiBito analyze this upper bound using the sigma algebra 


= \J ]{<<)} kU^ \J i(^)} kU]i = 0,...,n 

\j=i j=i / 


( 12 ) 


Define the random variable 

Vi = ||a;,'-S,'|| -E [||a:/-£;|| | 

Clearly, V) is -measurable. Since 

—diam(,^) < V) < diam(,^), 

and E [V, | = 0, we can apply the conditional version Hoeffding’s Lemma from Lemma|23to yield 

E [e^^' I < exp|idiam^( 

The collection of random variables and the filtration satisfy the conditions of Lemma |22] Before 

applying Lemma |22] we bound the conditional expectations 

E[||a;,--S,-|| I 

By assumption, we have 

E [IIa;,--*,'ll \^i-i] <C{Ki) /=l,...,n 
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and so 


(1 + i) 


n— 1 


I E[||a:i-Sill I ^o] +2 ^ E [||a;;-S;|| | ^/_i] ||+E [||a;„ - S„|| | 


i=2 
n— 1 


- [c{K^)+2''^C{Ki)+C{K„^ ^C,, 


Set 


and 


Cl j — — 


02 = ■■■ =a«-i = 


+ ^) 

n—\ 

(1+^) 


n—\ 


resulting in 




'2 („-l)2 

Applying our bound in (fT^ and Lemma|22]with this choice of a yields 


’{|Pn-pF| >C«+f} 

<p| 11®'“ = 

(1 + ; 


Xi + Xn-Xn 


^ _ n— 1 

> ^ ^ [II®1 -*i|l I =^o] + 2 ^ E [||at,--S,-|| I II +E [||x„-S„|| | ^„_i] 



1 +t 



< exp < — 


(n — l)^f 


2f2 


2”(l + ^)^diam^(^) 


Finally, we have 


£p|p„ <pP-C„-f„| < f]P||p„-pP| >C„+f„| 

(n-l)2f2 


n=2 


n=2 

< Lexp<j-- 


iS [ 2n (l + ^)^diam^( JT) j 

The claim follows from the Borel-Cantelli Lemma. 

If Lemmas |2]and|3]hold for the sequence {f„/2}, then for all n large enough it holds that 

Pn + Cn + Ci ^ +t„> p 


□ 


almost surely. 
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Lemma 4. It always holds that 


E[||a;,--x;|| I 



{dian? {!%'),Kj) 


Therefore, the choice 

C(Ki) =2\ -b (diam^{^),Ki) 
\m ^ 

satisfies the conditions ofLemmaf^ 

Proof. Using the sigma algebras defined in (fTST i yields 


E[||a;,--a,-|| I < E[||£Ci-a;-|| I ^,-_i]+E[||S;-a:-|| I 

< E 


■ ifiiXi) - Mx*)) I 


+ E 


{fi{Xi) - fi{x*)) I 


< 


< 2\j —b{A\d,\xf{f^),Ki 




where the third inequality follows from Jensen’s inequality. □ 

This choice of C(/r„)works for any algorithm with the associated b{do,K). For any particular algorithm, we believe 
that we can produce tighter bounds independent of diam(J?r) by copying the Lyapunov analysis used to analyze SGD 
as in AppendixlAl The analysis becomes algorithm dependent in this case and is omitted. 

Finally, we state an overall theorem for the direct estimate that gives general combined conditions under which p„ 
upper bounds p. 

Theorem 1. If \B.l ^KT\ hold and the sequence {f„} satisfies Y,n= 2 ^ ^ °° for all C > 0, then for a sequence of 

constants {C„} and for all n large enough it holds that p„ + C„ + f„ > p almost surely. 

Proof. Combine Lemmas |2] and |3] to yield the result with 

c — c + 


□ 


3.3.2 Vector IPM Estimate 

We first derive a version of Hoeffding’s inequality that allows for some dependence among the random variables. We 
use this concentration inequality to analyze p„ for the IPM estimate. Given an integer W, we construct a cover of 
{l,2,...,n}by dividing the set into W groups of integers spaced by W, i.e.. 


^j = <jJ + WJ + 2W...J + 


n-J 

W 


W 


j = \,...,W 


(13) 


Note that 




{l,2,...,n}= 

f=i 


and s^iC\S!^j —% for i f j. The proof of Lemma |3 is nearly identical to the proof of the extension of Hoeffding’s 
inequality from ifTSll with Lemma |22] used instead. We assume that if we refer to a filtration #,■ with i < 0, then we 
implicitly refer to .^o- 
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Lemma 5 (Dependent Hoeffding’s Inequality). Suppose we are given a collection of random variable and a 

filtration {=^}”=o ■smc/z that 

1. aj < Vi < hi for constants at and hi 

2. Vi is ^i-measurable i = 

3. Given an integer W and a cover as in (I131 l for each j it holds that 


E 


Vj+iw 




= 0 


n-J 

IV 


and 


E 


V,- 


^0 


= 0 


Then it holds that 


and 


1=1 


2f 




i=i 


2G 


£v,<-q<exp^-— 


Proof Define 


L^J 

Ui= ^ V j+iw 


1=0 


for i = 1,... ,1V. Let {pjYj^i be a probability distribution on {1,... ,1V} to be specified later. By Jensen’s inequality, 
we have 


exp< i XjV; 


< 


expjEp,-;/,- 
V / St 

2^p,exp<^ —Uj 


t=i 


St- 


Then it holds that 


E 


exp 


< 


w 

LPt-E 

t=i 


exp 


{si 


Now consider one term 


E 




r 

( s 1 

= E 


exp<^ —Uj 1 

exp< 

L St J J 


1 


L^J 


s 

pj h 


Since E; Vy+,vv E; bj+iw 


E 


Vj+iw 


S+{1-1)W 


= 0 , 


we can apply the conditional version Hoeffding’s Lemma from Lemma|23to yield 
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I I I I 

Then we can apply Lemmal22lto and to yield 


E 


exp 


{^} 


2 LvJ 

— 1 c~2 52 i^j+iW ~ Oj+iW ) 

^Pj ,=0 

LvJ r 2 'I 

= exp<^ I 


i=0 




Then we have 


E 


exp 


w 


LvJ r 

- HpJ Yl 

j=l i=Q [ ^Pj 


W 


= 

7=1 


S^Cj 


with 


LvJ 

Cj = E (^.7+iW ~ t^;+;w) 


Let pj = yfc]/! and 


Therefore, we have 


w 


r-Ev^- 

./=i 


E 


exp< ^E^; 


1=1 


<exp<; irV 


Applying the Chernoff bound IIT9ll and optimizing yields 

p|EV;>f| <exp{-2fVV} 

Bounding T with Cauchy-Schwarz yields 


. !=1 


w \ / w 


Em e^7 


W =1 / \;=1 


!=1 


and the results follows. The proof for the other tail is nearly identical. 
If we do not have the condition 3 of Lemma |5] then it holds that 


„ w LvJ 

E'^' ^ E E ® L^7+'W I V;+(,-i)wJ +f 

,= 1 ;=1 i =0 


> < exp < — 


2t^ 


Wn=i{bi-aiY 


If we can bound the conditional expectation 

^[Vj+iW I ’^j+{i-l)w] <Cj+iw, 
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by a -measurable random variable, then we have 


n n 


J=\ ;=1 


< 


< 


n W LVj 

i=l ;=1 <=0 

n W LVj 

Y [^J+iW I ’^j+{i-l)w] +t 

;=1 j=l i=0 


w LvJ 

Y Y - IE [Vj+iW I =^i+(/-l)w] ) > t 

7=1 1=0 


< exp< — 


2 f 2 


H^ILl(^l■-«l■)^ 

We have the following lemma characterizing the performance of the 1PM estimate. 
Lemma 6. For the IPM estimate and any sequence {f„} such that 


n=2 


Adiam{3FY 


oo 


for all n large enough it holds that Pn-i-t„> p almost surely. 

Proof. Dehne the random variables 

Vi = pi-E[pi I jei-2] 

with defined in (| 6 ]l. We have 

—diam(,;?r) < V,- < diam(,^) 

Clearly, V,- is ^-measurable and E[V,' | ,^^- 2 ] = 0. Now, we can apply Lemma|5]with W = 2 to yield 

2{nt)^ 1 


^ Vi < -nt > < exp < - 


. 1=1 


( 2 ) (4ndiam^(,^)) 


= exp 


nt 


4diam"^(,^) 


None of the random variables {zi{k)'\f^.^ and {zi-i{k)}i^fl are J ^_2 measurable. Also, regardless of how many 
samples Ki and Ki^i are taken, the IPM estimate is biased upward. Thus, it holds that 


e[AI^- 2 ]>p 


Therefore, it follows that 


n n 


¥{p„<p-t} < P^p; < ^E[pi I J^'_2] -nf 

[i=i i=i 

nt^ 


< exp<- 7 


4diam"^(,^) 
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Note that we pay a price of two in the exponent due to p,- and pi^ \ both depending on the samples from p,_ i. Since 


n=2 


-Zi—7 


4diam(^)2 


<1^ OO 


it follows that 


^P{p„+f <p}<+oo, 
n=2 


This in turn guarantees by way of the Borel-Cantelli Lemma that for n large enough 

Pn ^ P 

almost surely. 


□ 


3.4 Combining One Step Estimates For Bounded Change 

We now look at estimating p in the case that 

<p- 

We set 

Pi^\\x*-xU\\ 

B.3 Assume that we have estimators hw ■ R'^ — R such that 

1. E[hiy(py,.. .,Pj^w+i)] > P for all j >l and VT > 1 

2. For any random variables {p,} such that E[pj] > E[p,], we have 

E[hwiPj,- ■ ■ ,pj-w+l)] >R[W(P;,---,Pj-W+l)] 


For example, if p,- Unif[0, p], then 

hw (Pi, Pi+i, ■ ■ ■, Pi+w-i) = max{p;,Pi+i,...,P;+w-iI 

vV 

is an estimator of p with the required properties. Also, note that the two conditions on the estimator in lB.3 l imply that 

E[/V(Pj,---,P;-W+l)] >^[hw{Pj,---,Pj-W+l)] >P 
Given an estimator satisfying assumption IB.S I we compute 

p (') = hw {pi, pi- 1 ,..., pi-w+i) 


and set 


1 " • 1 ” 

Pn T ^P^^ 7 ^ 1} (Pi'i A—1 j ■ ■ ■ I Pmax{(—W+1,2} ) 

n- L i ^2 ” “ ^ i =2 


(14) 


We have 

E[p„] = ^tE[p»]>p 

” ^ i=2 

Lemma 7 (IPM Single Step Estimates). For the estimator in (O computed using the IPM estimate for pi and any 
sequence {f„} such that 

^ f 2{n-\)tl \ 

I {W+ l)diam{^f J ^ 

it holds that for all n large enough pn-i-tn> p almost surely. 
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Proof. We copy the proof of Lemma| 6 ]with W + 1 in place of 2 and note that and p^j'> with |/ — yj > W + 1 do not 
depend on the same samples. Lemma|5]and some simple algebra yields 

f 2(n — l)t^ I 

We pay a price of W + 1 in the denominator of the exponent due to the dependence of the p^'K By the Borel-Cantelli 
Lemma, for all n large enough it holds that Pn + t„> p almost surely as long as 

^ f 2(n-l)f2 1 

(VL+l)diam( J ^ 

□ 

To analyze the direct estimate, we need the following assumption 
B.4 Suppose that there exists absolute constants for any fixed W such that 

w 

\hw{pu---,pw)-hw{qu---,qw)\ < Vp,qeR>o 


For the uniform case, we have 


W +1 W +1 

^ max{j7i,...,pw}-^max{q'i,...,q'H'} 


< maxjlpi -gi|,...,|pw-gw|} 

< ^+lf| I 


so 


b\ = ■ ■ ■ = bw = 


W+l 

W 


Under assumption IB.T I we can then show that 


P„ = 


1 


L p 


(0 


” ^ i=iy+i 


eventually upper bounds p by copying the proofs of the lemmas behind Theorem[T] 
Lemma 8 (Direct Single Step Estimates). Suppose that the following conditions hold: 

1. \BJWT\ hold 

2. The sequence {f„} satisfies 


2f2 


and 


E »p- L - 

"-»+■ [ 32n(l+|)^(E7.,l),) dimf(X) 

(n - W fmhl 


< +00 


L exp < - 


n=W+\ 


lAAnCo 


3. There are bounds C{K) such that 


E[||x;-X;|| I < C{Ki) 


< +°o 
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Then for all n large enough it holds that Pn + U„ + Vn +1„ > p almost surely with 


Un = 


n-W 


■Lem 


i=l 


and 


" m{n — W) V dKi 


Proof. Define pj;^\ pp\ pf'\ and p\^^ as in Lemmas |2] and [3 First, we have 


t(3) 


|pn-pi^'l < 


1 


< 


< 


< 


n-W 

1 

n-W 

1 

n-W 


E ip''’-P 3 


Ml 


/=W +1 

n i 


E E ^iP;-P 


(3) I 


i=w+l jW-W +1 

n i 




n-W 


E E ^t(lP;-pfl + lpf-pfl) 

i'=W+l ;=i-W+l 

'""'E(lA-Pf’l + IPf’-A®l) 


Second, define 
and 

Then we have 




Ui = \]xi-Xi\ 


Y E a^(-{Xi,Zi{k)) -Vfi{Xi)) 

k=\ 


\Pi-pP\ < ||a;,'-S,'||+- 


Ki 


T 7 E {'^=ciixi,Zi{k))-Vx£{xi,Zi{k))) 

A/ 


yt=l 


< 1 + - (t/, + t/;_i) 


and 


Then it follows that 


|pP)-p(3)| < + 


/ , y'3' h ■ " 

|p„-#| < 


n-W 


< 


E(iA-A®i+ipf’-pPi) 

n 

Et/<+ 




n-W 


i=l 


i{n - W) ,t1 


E^ 


Suppose that 


and 




n-W 


i=\ 


2Y^ b- ” 
m[n — W) 
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Then it holds that 


P 


11 Ai ~ P« ^1 > U„-\-Vn + t'^ 


< P< 






< P 






n-W 


'£Ui>Un 


+p< 


/=1 




We can apply Lemma|22]to each term to yield 


n-W 


^ t/,- > l7„ + - ^ < exp <1 - 


(n — Wft 


2t2 


i=l 


and 


Then it holds that 




32«(l + i)^(l^i^) diam2(jr)^ 

(n - Wfn?t'^ 


m{n — W) 

— Pn ^1 > Un + Vn +t'^ 
< exp ^ — 


144nC 






(n - W)h 


2,2 


32n(l + i)'(l7=iP;) diam2(jr) 
We have by straightforward computation 

2(1 + ^) 17 . 1 ^;^ 


+ exp < - - 


(n - Wfrn^t^ 


144nC 


(rl'L.i.,) J 


Un = 


n-W 




i=l 


and 

Then it holds that 

£ p{p„<p-i7„-y„-f„} 


. ^ 21 ^ « ^ 
" m(n — W) V dKi 


n=w+\ 


< £ p{p„<#-i7„-y„-f„} 

n=W+\ ^ ^ 

^ ^ \\Pn —p^\> Un+Vn + tr}^ 

{n-Wftl 


< 


n=’W+\ 


< ^ exp 

n=W+\ 


32n(l + ^)'(l7=iP;) diam2(jr)^ 


E exp<-- 


(n - iy) 2 m 2 f 2 


«=W +1 


144wC 


( 17 ..‘j)' 


<7 00 


By the Borel-Cantelli lemma, it follows that for all n large enough 

Pn-\-Un + yn + tn < P 

almost surely. 
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3.5 Parameter Estimation 

We may need to estimate parameters of the functions {/„} such as the strong convexity parameter m to compute 
b{do,K). We need the following assumption on our bound; 

D.l Suppose that our bound b{do,K, \j/) is parameterized by y/, which depends on properties of the function i{x,z) 
and the distributions {pn}‘^=i- Suppose that 

Wi<V2 b{do,K, v/i) < b{do,K, v/ 2 ) 

D.2 There exists a true set of parameters \j/* such that 


= \ff* Vn > 1 


D.3 The spaces ^ and ^ are compact 
D.4 There exists a constant L such that 


\W^i{x,z)-V^i{x,z)\\ <L\\x-x\\ 


D.5 Suppose that we know that the parameters xj/ £ with compact 
D.6 Suppose that V/„ {x„) has Lipschitz continuous gradients with modulus M 

As a consequence of Assumntion lD.T I it follows that there exists a constant G such that there exists a constant G such 
that 

||V^£(a;,2)|| <G Va;Gjr,^e^ 

Satisfying Assumption ID.S l is usually easy due to the compactness assumptions in Assumption ID.T I 
In most cases, we have 

—m 
M 
A 
B 


W = 


where m is the parameter of strong convexity, M is the Lipschitz gradient modulus, and the pair {A,B) controls gradient 
growth, i.e., 

E\\W^e{x,z)f<A+B\\x-x*f 

We parameterize using —m, since smaller m increase the bound b{do,K). We present several general methods for 
estimating these parameters, although in practice, problem specihc estimators based on the form of the function may 
offer better performance. As an example, we present problem specihc estimates for 


^ix,z) = ^ (^y-w~^xj 


1 


Alla^lP 


As in estimating p, we produce one time instant estimates m,. Mi, A,, and Bt at time i and combine them. We only 
examine the case under Assumption lD.4 I although we could examine an inequality constraints as with estimating p. 
We combine estimates by averaging to yield 

1- '«n = 

2. Mn = lHUMi 

3. = 

4. K = 
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3.5.1 Estimating Strong Convexity Parameter and Lipschitz Gradient Modulus 

We seek one step estimators and such that 


E[ot„ I < ni 


and 

with defined in 

Hessian Method: We exploit the fact that 


E[M„ I X-i] >M 


'^xxfn{x) >: ml Va; G JT 


This in turn implies that 


Kin > m VatGjr 


This suggests that given {zn{k)}p.'^Y we set 


«= min Amin ( 


Kn 


k=l 


Since 


Amin(A) = min {Av,v), 

v:|| v|| = l 

Amin (-A) is a concave function of A. Then by Jensen’s inequality, we have 


E[fi 


E 


min Amin — £ Vljix,Zn{k)) 


Kn 


k=l 


< minE 


Kin ( ^xx^{x, Zn{k)) 


< min Amin IE 

X^X' \ 


■^'E'^lx^ix,Zn{k)) 

k=l 


X-i 
X-i 


min Amin (KxMx)) 


Similarly, we can set 


Since 


/ 1 

M„ = max Amax — Y. 

xex fr^i . 


Amax(A) = max {Av,v), 

v:\\v\\ = l 

Amax (A) is a convex function of A. By Jensen’s inequality, it holds that 

E[M„ I X-i] > M 

Gradient Method To Compute m,,: To actually minimize over x, we can use gradient descent. To apply gradient 
descent, we use eigenvalue perturbation results 1^ . Suppose that we have a base matrix Tq with eigenvectors uq, and 
eigenvalues Aq/. We want to find the eigenvectors u; and eigenvalues A, of a perturbed matrix T: 


ToVoi = K^Oi 

Tvi = XiVi 
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In particular, we want to relate Xoi to A,. With 
we have 

and 


dXi 


dTi, 


5T = T-To, 
dXi = v^,{dT)voi 
= voi{i)voj{2-5ij) 


Suppose we are given a matrix-valued function T{x) with 

T{x)v{x) = Xyni„{x)v{x) 

Then it holds that 

V^Kn^iTix)) = Y^^V^T.jix) 

= Y^Vi{x)vj{x){2- 5ij)VxTij{x) 


tj 


Then we can use gradient descent to solve 


f I K„ ^ 

min-Imin — J] V^£{x,z„{k)) 

x€^ \K.fi 


Starting from any tc(0), we can compute 
x{p)=U^ 

and set 


1 


K„ 


1) p'^xXminl ^V^^£(x,Z„{k)) 

k=l j 


= Amin —^Vlj{x{P),Zn{k)) 


p=l,...,P 


(15) 


Heuristic Method: For any two points x and y, we have by strong convexity 

My) > fnix) + {Vf„{x),y-x) + ^m\\y-xf 

Suppose that we have N points at (1at (N). Then we know that for any two distinct points at, and at j 

Mx{i)) - Mx{j)) - (V/„(at(;■)), at(0 - at(;)) 


m < 


i||at(0-at(;)|P 


This suggests the estimator 

, i:l^lx^£{x{i),z„{k)) - ^yllM^{i),z„{k)) - x^{x{j),z„{k))M^ - x{i) 


rUfj = mm - 

¥j 


i||at(0-at(7)|p 


(16) 
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for the strong convexity parameter. Then we have 

E[m„] 


= E 


1 




mm ■ 

¥j 


1 


^\\x{i)-x{j)P 


<minE 

¥J 


1 


^\\x{i)-x{j)f 


< ^■^^ /H(a;(0)- /«(at(;)) - {Vfn{x{j)),x{i) - x{j)) 


¥J 


^\\x{i)-x{j)f 


It is difficult to compare this estimator to m exactly. All we can say is that 

m < - {^fnixU)),x{i) - x{j)) 


¥i 


\\\x(i)-x{i)f 


as well. In practice, this method produces estimates close to m. 
Similarly, we can set 


M„ = max 

¥j 


^ Lkh i{x{i),z„{k)) - ^ i(x{j),Zn{k)) J y^£{xU),Zn{k)),x{i) - x{j) 


^\\x{i) - x{j)f 
Problem Specific: For the penalized quadratic, we have 

yZ _ 1 r I T 


(17) 


y^^i{x,z) = XI + WW 


so 


'^Ixfnix) = XI + E[WnwJ] 

This suggests the simple closed-form estimates 


/ 1 K„ 

ihfi — X Amin I tUn (A) rUn (A) 

\^« A:=l 

and 

/ 1 K„ 

— X Amax I ~p~ 'tUn(A)'?Un(A) 

t:=l 

Again, by Jensen’s inequality, it holds that 

E[m„ I < m 

and 

E[M„ |X_i] >M 

Combining Estimates: We now look at combining the single time instant estimates of the strong convexity parameter 
and the Lipschitz gradient modulus. 

Lemma 9. Choose t„ such that for all C > 0 it holds that 

^ _|_oo 

n=\ 


Then for all n large enough it holds that 
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1. m„ — tn<m 

2. Mn+tn>M 

almost surely. 

Proof. By the compactness of the space containing y/, we can apply the dependent version of Hoeffding’s lemma 
(Lemma l2Jt to yield 



and 



for some constants and O-M derived from Hoeffding’s lemma. Then applying Lemma|22l it follows that 



We know that 


- I > m 

^ .—1 


i=l 


so it follows that 



Similarly, for the Lipschitz gradient modulus, it holds that 


P {M„ < M — } < exp 



As before, we have 



and 



to ensure that almost surely for all n large enough it holds that 


'«n - tn < m 


and 


Mn + tn>m 


□ 


For Lemma|9] we need t„ to decay no faster that ffin 
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3.5.2 Estimating Gradient Parameters 

From Assumption lD.6 I it holds that 

E||V,£(a;,^)||2 = E\\V^e{x*,z) + {V^e{x,z)-V^e{x*,z))f 

< 2E\\V^£{x*,z)f + 2E\\Va{x,z)-V^e{x*,z)f 

< 2E\\V^e{x*,z)f + 2M^\\x-x*f 

Thus, we can set 


and 


B^2M^ 
A=2E\\V^i{x*,z)f 

This suggests that given an estimate Mn for M, we set 


Then by Jensen’s inequality, we have 


Bn = 2Mi 

E[B„|X_i] = 2 E[m 2|^^_^] 

> 2(E[B„ I 

> 2M^ 

= B 

Lemma 10. Choose t„ such that for allC > 0 it holds that 

£ < +0O 

n=l 

Then for all n large enough it holds that 

Bn ~£tn> B 

almost surely. 

Proof By identical reasoning for the strong convexity and Lipschitz continuous gradients, it holds that 

ntl 


<exp<j 

ZCJn 


Since we have 


n=\ 




2ai 


for all n large enough it holds that 

Bn ~i~tn^B 

almost surely. 

To estimate A, consider using a point x to approximate a;*. It holds that 

E\\V^i{x*,z)\\^ = E\\V^e{x,z) + {V^£{x*,z)-V^i{x,z))f 

< 2E\\VJ{x,z)f + 2E\\VJ{x*,z)-V^£{x,z)f 

< 2E\\yj{x,z)f + 2M^E\\x-x*f 

< 2E\\V^£ix,z)f + 2(^^'^ l|V/(a;)f 

< 2E||V,^(a;,^)||2 + 2(^^^ \\^f{x)f 


□ 
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This suggests the estimate 


Mx) = J- E l|Va,£(at,z„(^))f+4 ^ 

\f^n—l ^n—l 


2 II 1 Kn 

— j^V^£{x,Zn{k)) 
J^n 1,^1 


Lemma 11. For any x possibly random but not a function of {zn{k)}ff^ and all n large enough, it holds that 

V.[An\J(fn-l] >A 

Proof. For any x possibly random but not a function of {zn (^)}fT[, it holds that 


E[An I X-l] 


= E 


= E 


J- E \\'^^£{x,z„{k))f+4- ( 

\ 


1 “t” in— 1 
V 1 in— 1 


Y^^^£{x,Zn{k)) 


k=\ 


X-1 


K, 


E a,£{x,Zn{k))\Y 


« i:=l 


X 


n—1 


+ 4 


Mn— 1 4” 1 

ihfi—\ in—\ 


E 


1 Kn 


K, 


^ Vx£{x,z„{k)) 


« t:=l 


X-1 


^ 2 

\\Vf„{x)\Y 

\mn-l -in-\ J 


The last inequality uses Jensen’s inequality. Then by our prior analysis, almost surely for all n sufficiently large it 
holds that 


Mn- \ +in-\ ^ ^ 
mn-\-in-\ ~ m 


and so for all n sufficiently large 


M 


IE[A„|X-i] > 2E||V,f(a^,2„)f+4(^-j \\^fn(x)f 

= 2E||V,f(at:,^„)f 
= A 

Therefore, for all n sufficiently large (dependent on estimation of m and M), it holds that 

E[A„|X-i]>A 


□ 

Combining Estimates for A: In practice, we use A„(x„), which complicates the analysis due to the fact that Xn is 
computed using the same samples {zn(k)}ff^. 

Lemma 12. Choose tn such that for all C > 0 it holds that 

oo 

^ < +00 
n=\ 


Then for all n large enough it holds that 
almost surely. 


An tfif_ A 
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Proof. Consider the following three estimates of A all computed with knowledge of m and M and x„ as in Lemma|2| 


^(2) _ 1 


ti V'«/ 




' k= 

2 


1 


K, 


k=l 


K, 


= 2E||V,£(*,-;2,)f+ 4(^^) ||V/;-(a 


' A:=l 
2 


Define the averaged estimates 


We always have 


4(2) 

/\n 


1 ^ .( 2 ) 


= 

n 


1=1 


(3) _ 1^4(3) 


= -L^i 

1 ^(4) 

1=1 


A 

/in 

Aif> ^ iE4| 


aI"^^ > A 


i(2) ic 


> A 


.(3) 


First, we show that A„ ’ is close to A„ . We have 

|AP)-Ap)| 


<2 


Ki 


k=\ 


+ 4 
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yielding 


Second, we have 
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Combining both inequalities, we know that 
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The first and third terms in this bound can be controlled by the analysis of the direct estimate and the second term by 
Lemma (l2^ . This shows that 
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almost surely for all n large enough, it holds that 


In addition, we have 


There exists a random variable JV such that 


Then for n>N,il holds that 




VP + -Y,-^+2t„>A 

nt'iVKi 


- r, A/n + t„ M 

n> N => -> — 

m„-t„ m 


A 

A « 

= -E 

4 IV -1 

>-E 


Mi-1 +ti-i 
ihi^l - f,_i 

M,_i +t,'-i 

'«i-i -h-i 


m / 


YY.^v{xi,zi{k)) 

k=l 

^'£v^e(xi,Zi{k)) 

k=l 


29 

























Since our choice of f„ can decay only as fast as Cj^Jn, it follows that 
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k=i 
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-t„<0 


for all n large enough. This implies that 



>An- 




/ Mj-i +fi-i 
Vm,_i +f,_i 


^'^v^e{xi,zi{k)) 


k=l 



> A 



for n large enough. 

Using these estimates, we have constructed estimates such that for all n large enough it holds that 

V4 + Cn + ^ Y* 

for appropriate constants C„ almost surely. Therefore, by assumption for all n large enough it holds that 

b{do,K, xj/*) < b{do,K, \i/„ + t„) 


□ 


3.5.3 Effect on p Estimation 


Our analysis of estimating p assumes that we know the parameters of the function and in particular the strong convexity 
parameter m. We now argue that the effect of using estimated parameters instead is minimal. This happens because 
we know that for all n large enough it holds that 

Wn > W* 


almost surely. 


Lemma 13. We want to estimate a non-negative parameter (j)* by producing a sequence of estimates (pi for all i > 1 
and averaging to produce 

1 ” 

“ E 

where the estimates (pi are dependent on an auxiliary sequence \j/i in the sense that (pfxj/i). Suppose that the following 
conditions hold: 


1. Suppose that there exists a random variable ff such that n>N implies that \(r„ > \f/* 

2. E[^i{r)] > r 


Then it follows that 


1 " 

liminfE - V (pi 

fl 


> i>* 


30 



Proof. It holds that 


Therefore, it follows that 



1 " 

liminfE - V 0,- 

n-^oo Yi 


> liminfE 

n^oo 


> 0* 


- L 


( 18 ) 


□ 

We can extend all the concentration inequalities for estimating p as well by extending the inequality in (fTSl) to 
yield 
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i=\ 


/=1 


ytuwD+oii) 
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Before, we have analyzed 


fLMwn 


so for large enough n, we recover previous results, since the o (1) term goes to 0. 


4 Adaptive Sequential Optimization With p Unknown 

We now examine the case with p unknown. We extend the work of Section |2] using the estimates of p in Section [3 
Our analysis depends on the following crucial assumption: 

C.l For appropriate sequences {f„}, for all n sufficiently large it holds that Pn + tn> p almost surely. 

C.2 b{do,K„) factors as b{dQ,K„) = a{K„)dQi + ^ {Kn) 

We have demonstrated that assumption |CT] that holds for the direct and IPM estimates of p under (|2]) and ([3l- Note 
that whether we assume (|2|l or Q does not matter for analysis. 

4.1 General Condition on 

We start with a general result showing that for any choice of Kn such that Kn > K* for all n large enough the excess 
risk is controlled in the sense that 

limsup(E[/„(a;„)] -/„(a:*)) < e 
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We then apply this result to two different selection rules for Kn. 

Consider the function ^ 

(l)K{v) = a{K)(^yJ^v + p^ +liiK) 

derived from assumption C.2. Note that as a function of v, 0a:(u) is clearly increasing and strictly concave. First, 
suppose that we select K* dehned in (|5]l. Then by dehnition it holds that 

fe*(e) < e 


We study fixed points of the function (pK* (v): 

Lemma 14. The function (pK* (v) has a unique positive fixed point v with 

1. v = (j)K* (v) < e 

2 . < 1 
Proof We have 

ipK*{0) = a[K*)p^ + ^[K*)>0 

Since 

lim <^K* (v) = (t>K* (0) 
v^O 

and (jtK* (0) > 0, there exists a positive a sufficiently small that 


Next, expanding ^k{v) yields 


(1>K* (a) > a 


^k{v) = —a{K)v + 2a{K)p\ —^+a{K)p'^ + j5{K) 
m \ m 

Since (pK* (e) < e, we obviously must have ^a{K*) < 1. Suppose that 

-a{K*) = 1 


Then it holds that 


<j)K*{£) = £ + V^P'/£ + —p^ + l5{K) > £ 


This is a contradiction, so it holds that 


It is thus readily apparent that 


-a{K*) < 1 


V - 0/s:* (v) ^ ' 

as V —oo. Therefore, there exists a point b> a such that 


{b) < b 

It is easy to check that ^k*(v) is increasing and strictly concave. Therefore, we can apply Theorem 3.3 from m] to 
conclude that there exists a unique, positive hxed point v of (v). 

Next, suppose that K* (y) > 1. Then by Taylor’s Theorem for v > v sufficiently close to v, we have 

^K* (v) > V 
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However, we know that as v —oo, it holds that v — (j)^* (v) — °o. By the Intermediate Value Theorem, this implies that 
there is another fixed point on [v,oo). This is a contradiction, since v is the unique, positive fixed point. Therefore, it 
holds that (v) < 1- Now, suppose that (v) = 1. Since ^k* (v) is strictly concave, its derivative is decreasing 1221 . 
Therefore, on [0, v), it holds that 

> 1 


This implies that 


<pK* (v) = 


> 

> 


fe*(0)+ / <pK*iv)dx 
Jo 

(j)K* (0) + V 


This is a contradiction, so it must be that (j)^* (v) < 1. □ 

As a simple consequence of the concavity of ^k* (v), we can study a fixed point iteration involving 0 a: (v). Define 
the n-fold composition mapping 

Lemma 15. For any v > 0, it holds that 

lim (v) = V 

Proof. Following l2^ . for any fixed point v, it holds that 


|0/t‘(v)-v| < 0l:.(v)|v-y| 


Therefore, applying the fixed point property repeatedly yields 

\^kHv)-v\ < i^K*iv)T\v-v\ 


By Lemma[T4l it holds that 


and so the result follows. 


0^.(y) < 1 


□ 


Now, we show that we appropriately control the excess risk when we estimate p. The extension of this argument 
to the case when we also estimate function parameters \j/ is straightforward. If we have 

Kfi 

F({^«W}f=i I x„-uKn) = 

k=l 


then 



Therefore, it holds that 


^[fn{Xn)]-Mx*„) 


b 



Suppose that we set 
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This sigma algebra contains all the information about {p„} and thus {Kn}. Then, we do not have 




tr=l 


since K„+\,Kn+ 2 ,... are a function of We do not even have 

E[/„(a:„) I Xo]-/„(**) < b {fn-\{x„-\) - fn-l{xl_^)) +p^ ,Kn 

However, we would expect that this is not too far from true. Conceptually, we consider running our approach twice on 
independent samples. The first run determines the required number of samples We then run our process for 

a second run with these fixed choices of {/r„}“^[and independent samples as in Figure[T] For the second run, it is true 
that 

P{{z^n\k)}lLl I ^oo) = np„(4^'(^)) 


tr=l 


and 


E 


f„{x^n^)\Xoo -fn{xl)<bi (/«-l(4-\)-/«-l«-l)) +P^ 


In practice, we do not need to run our process twice. This is only a proof technique. Now, for the second run the 
recursion 


4^^ = b 


-e4\+p) ,K„\ Vn>3 


(19) 


with ei and £2 from Assumntion lA.T [ bounds the excess risk of the second run 

I Xo] -/„(a;*) < eP 

Then it follows that 

E[/„(a;f)]-/„(a;:)<E[£i4 
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Figure 1: Two Run Process 


We now argue that E[4^^] 


also bounds the excess risk of the first run. 
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Lemma 16. For the first run, it holds that 

WniXn)]- fn{x*n) < 

Proof. We proceed by induction. For n= 1,2, we know that 

nf„{x„)]-fn{xi)<n£^\ 

by definition. Next, suppose that 

E[/„-1 {xn- 1)]-/«-1 {xl_ 1) < E[e^^\] 

We have 

"^[fnixn)]-fn{xl)<¥. a{Kn) 1 (at„-1)-/«-!) +p) +^{Kn) 

so it holds that 

E[ei^^] - (E[/„(a;„)] -/„(a;*)) 

— E CC(Kfi) £nl\ — CC(Kfi) {^\J fn-\(x„^i) — /n -1 (at*_ J ) + 

= E [a{K„) (ef\ - {fn-l{Xn-l) - fn-l{xU))) 

+ E 2pCC(Kfi) £n\ ~ \Jfn-l(Xfi-l) — fn-l{x*^_^^ 

By the Monotone Convergence Theorem, it holds that 

E [a{Kn) (ef\ - {fn-l{Xn-x) - fn-l{xU))) 

= _limE max{a(li:„),l/^} - (/«-i(at„-i)-/„-i(at*_i))) 

> liminf-E - (/«-i(at„_i)-/„_i(a:*_i)) 

>0 

where the last line follows, since by hypothesis 

E[/„_ 1 (at„_ 1 )] - fn -1 {xl_ 1 ) < E[e^^\] 
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Similarly, it holds that 


E 


2pa{K„) I y £^J[ y^/«-! (^^n-l) /n-l 

~ [fn -1 {Xn- 1) ~ fn -1 (®ij i)) 


= E 


2pa{K„) 


= lim E 

q^oo 


\J^n\ \J /n-1 (®n-l) 

- (/«-iK-i)-/«-iK i)) 


2p max{a(/:„),l/^} 


\l ^n\ + fn -1 (®n-1) fn -1 (33*_ i) 


2p 

> lim sup—E 
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2p 

> lim sup — lim E 
q^^ q 


^n-1 (/n-l(3;n-l) 

\l^n\ + y^/n-1 (®n-l) ~ 

~ (/«-l(®n-l) ~/«-l(®n-l)) 


2p 1 

> lim sup — lim sup —E 

q^tx, q T—>1>0 T 

>0 

Therefore, we conclude that 


^n\~ (/«-l(®n-l) ~/«-l(®n-l)) 


E[/„(a;„)]-/„K)<E[ei")] 


□ 


Theorem 2. Under assumDtions \C.l \\C.2 \ and with K„ > K* for all n large enough almost surely with K* from (I201 l. 
we have 

limsup„^„ (E[/„(a;„)] -/„«)) < e 

Proof Let v be the fixed point associated with (jtK* (v) from Lemma[T4l We know that 

V = (r) < e 

and 

(jtj^^iv) V < e 

with V < e. Since we have K,, > K* for all n large enough almost surely, there exists a random variable N such that 

n>N =» K„>K* 

Then we have almost surely 

{'T\ 

lim sup < lim sup(0^:„ o---o^k.) (ey_ ^) 

n^oo «—)-oo 

< limsup4r'^^'^(ey_i) 


= V 

< s 
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Finally, applying Lemma[T9]and Fatou’s lemma yields 


limsup(E[/„(a;„)] -/„(a:*)) 


< limsupE 




( 2 ) 


< 

< 


E 


lim sup 

foo 


e 


□ 


4.2 Update Past Excess Risk Bounds 

We first consider updating all past excess risk bounds as we go. At time n, we plug-in p„_i + f„_i in place of p and 
follow the analysis of Section|2] Define for i= 

= b — £i"\ +(Pn-l+tn-l)^ 


If it holds that p„_i + f„_i > p, then E [/„(a;„)] — fn{x*) < for i = Assumption 1C. 1 [ guarantees that this 

holds for all n large enough almost surely. We can thus set K,, equal to the smallest K such that 

b ^^y^max{e^"^'^,e} + (p„_i+f„_i)^ < e 

for all n > 3 to achieve excess risk e. The maximum in this definition ensures that when p„_i +f„-i > p, Kn > K* 
with K* from (|5]l. We can therefore apply Theorem|2] 

4.3 Do Not Update Past Excess Risk Bounds 

Updating all past estimates of the excess risk bounds from time 1 up to n imposes a computational and memory burden. 
Suppose that for all n > 3 we set 


K„ = min 


K> 1 



1 +fn-l 




( 20 ) 


This is the same form as the choice in (|5]l with p„_ i + f„_i in place of p. Due to assumption IC.l I for all n large enough 
it holds that Pn + tn > p almost surely. Then by the monotonicity assumption in lA.l 1 for all n large enough we pick 
Kn > K* almost surely. We can therefore apply Theorem|2] 


5 Experiments 

We focus on two regression applications for synthetic and real data as well as two classification applications for 
synthetic and real data. For the synthetic regression problem, we can explicitly compute p and x* and exactly evaluate 
the performance of our method. It is straightforward to check that all requirements in lA.l IIA.41 are satisfied for the 
problems considered in this section. We apply the do not update past excess risk choice of Kn here. 
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5.1 Synthetic Regression 

Consider a regression problem with synthetic data using the penalized quadratic loss 

^{x,z) = ]^{y-w^x^ +^X\\xf 

with 2 ; = {w,y) € The distribution of z„ is zero mean Gaussian with covariance matrix 

< 7^-1 ^wn,y„ 
rT (72 

L 'w„,y„ '~'y„ 

Under these assumptions, we can analytically compute minimizers x* of fn{x) = V,z„r^p^ [(-{x, 2 „)]. We change only 
rwn,yn (^y„ appropriately to ensure that \\x* — = p holds for all n. We find approximate minimizers using 

SGD with X =0.1. We estimate p using the direct estimate. 

We let n range from 1 to 20 with p = 1, a target excess risk e = 0.1, andfCi from (l20l i. We average over twenty runs 
of our algorithm. Figure |2] shows p„, our estimate of p, which is above p in general. Figure |3] shows the number of 
samples Kn, which settles down. We can exactly compute fn{x„) — fn{x*„), and so by averaging over the twenty runs 
of our algorithm, we can estimate the excess risk (denoted “sample average estimate”). Figure |4] shows this estimate 
of the excess risk, the target excess risk, and our bound on the excess risk from Section 1431 We achieve at least our 
targeted excess risk 
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Figure 2: p Estimate 


Figure 3: K„ 
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Figure 4; Excess Risk 


5.2 Panel Study on Income Dynamics Income - Regression 

The Panel Study of Income Dynamics (PSID) surveyed individuals every year to gather demographic and income 
data annually from 1981-1997 ll24ll . We want to predict an individual’s annual income (y) from several demographic 
features (w) including age, education, work experience, etc. chosen based on previous economic studies in ||25l. The 
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idea of this problem conceptually is to rerun the survey process and determine how many samples we would need if 
we wanted to solve this regression problem to within a desired excess risk criterion e. 

We use the same loss function, direct estimate for p, and minimization algorithm as the synthetic regression 
problem. The income is adjusted for inflation to 1997 dollars with mean $20,294. We average over twenty runs of our 
algorithm by resampling without replacement ll2^ . We compare to taking an equivalent number of samples up front. 
Figure |5] shows the test losses over time evaluated over twenty percent of the available samples. The test loss for our 
approach is substantially less than taking the same number of samples up front. The square root of the average test 
loss over this time period for our approach and all samples up front are $1153 ± 352 and $2805 ± 424 respectively in 
1997 dollars. 
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Figure 5: Test Loss 


5.3 Synthetic Classification 

Consider a binary classification problem using = 5(1 — + 5 ' 4 ||a;|P with z = (in,y) G x R and 

(y)+ = max{y,0}. This is a smoothed version of the hinge loss used in support vector machines (SVM) ll2^ . We 
suppose that at time n, the two classes have features drawn from a Gaussian distribution with covariance matrix d^I 
but different means and i.e., w„ \ {y„ = /} , cr^/). The class means move slowly over uniformly 

spaced points on a unit sphere in R'^ as in Figure | 6 ] to ensure that (|2|i holds. We find approximate minimizers using 
SGD with X =0.1. We estimate p using the direct estimate with f„ 0 = 1 



Figure 6 : Evolution of Class Means 

We let n range from 1 to 20 and target a excess risk e = 0.1. We average over twenty runs of our algorithm. As 
a comparison, if our algorithm takes samples, then we consider taking samples up front at « = 1 . 

This is what we would do if we assumed that our problem is not time varying. Figure |7] shows p„, our estimate of p. 
Figure0shows the average test loss for both sampling strategies. To compute test loss we draw Tn additional samples 
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from and compute ^ We see that our approach achieves substantially smaller 

test loss than taking all samples up front. 
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Figure 7; p Estimate 


Figure 8; Test Loss 


5.4 General Social Survey - Classification 

The General Social Survey (GSS) surveyed individuals every year to gather socio-economic data annually from 1981- 
2013 ET). We want to predict an individual’s marital status (y) from several demographic features (w) including age, 
education, etc. We model this as a binary classihcation problem using loss 

-y{w^x))\ + ]^X\\xf 

with 2 ; = {w,y) G x R and (y)+ = max{y,0}. This is a smoothed version of the hinge loss used in support vector 
machines ES\ . We hnd approximate minimizers using SGD with A =0.1. Figure |9] shows the test loss. We see that 
our approach achieves smaller test loss than taking all samples up front. We also plot receiver operating characteristics 
(ROC) E6i to characterize the performance of our classihers. In particular we plot the ROC for 1974 in FigurefTOland 
the ROC for 2012 in Figure[TT] By examining the ROC, we see that taking all samples up front is much better in 1974 
but much worse in 2012. 


6 Conclusion 

We introduced a framework for adaptively solving a sequence of optimization problems with applications to machine 
learning. We developed estimates of the change in the minimizers used to determine the number of samples Kn needed 
to achieve a target excess risk £. Experiments with synthetic and real data demonstrate that this approach is effective. 
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A Examples of Z?(Jo, 

Lor this section, we drop the n index for convenience. The bounds of this form depend on the strong convexity 
parameter m and an assumption on how the gradients grow. In general, we assume that 

E,^p\\V^e{x,z)f <A + B\\x-x*f 

The base algorithm we look at is SGD. Lirst, we generate iterates x{Q),... ,x{K) through SGD as follows: 

x(I+l) = Us-[x{i)-iL(£+l)NJ(x{£),z{£))] £ = 0,...,K-l 

with a;(0) fixed. We then combine the iterates to yield a final approximate minimizer 

x{K) = <l)(x{0),...,x{K)) 

Lor our choice of we look at two cases: 
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1. No iterate averaging, i.e., 

(j){x{0),...,x{K)) ^x{K) 

2. Iterate averaging, i.e, for a convex combination 

(j){x{0),...,x{K)) = ^ A(f)a;(f) 

e=o 

Define 

di£)^\\x{£)-x*f 

First we bound E[^f(£)] in Lemma [T tI 

Lemma 17. Suppose that the function f{x) has Lipschitz continuous gradients. Then it holds that 

I It 

nd{£)]<Yl{l-2mii{£)+Bp^{£)) + ^ fj {I-2mp{i) + Bp^{i))p^{k) 
k=l k=li=k+l 

Proof. Following the standard SGD analysis (see M), it holds that 

d{£) < \\xi£-l)-x*-p{£)V^£{xi£-l),zi£))f 

< di£-l)-2B{£){x{£-l)-xfV^£ix{£-l),z{£)))+p\£)\\V^£{x{£-l),z{£))f 

Then it follows that 

E[d{£)\x{£-1)] 

<d(£-l)-2p{£){xi£-l)-x*,Vf{x{£-l)))+p\£)E[\\V^£{x{£-l),zi£))f\x{£-l)] 
< (1 - 2mp{£)+Bp^{£))di£ - 1) + B^i£- 1)A 


and 

Since B > m, we have 

and so 


E[di£)] < il-2mp{£)+Bp^{£))E[di£-l)]+p^{£-l)A 


B 


B 


1 1 


2mix-Bp^<2\l-p\l-^-p\<2-^ 2 


1 - 2mp{£)+Bp\£) > 1 - ^ ^ 


Since this quantity is non-negative, we can unwind this recursion to yield 

^ i i 

E[di£)]<Yl{l-2mii{£)+Bii^{£)) + ^ fj [I-2mp{i) + Bii^{i))p^{k) 
k=l k=li=k+l 


The bound in LemmafTTlcan be further bounded into a closed form as follows from ll28l : Define 

ifpfO 


llog(f), ifj3=0 


Then with fi{£) = C£ “, it holds that 


f2exp{2BC>i_2aW}exp{-^^£' “}(E[t/(0)] + f) + ^, ifO<a<l 

‘ *”-i=^(EW0)l + |)+AC22=|ia^, if« = l 


( 21 ) 


□ 
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Note that this bound is a closed form but is substantially looser than Lemma [17] In the case that the functions in 
question have Lipschitz continuous gradients, we introduce a bound on the excess risk using Lemma [TtI This case 
corresponds to choosing 

<^{x{Q),...,x{K)) = x{K) 

Lemma 18. With arbitrary step sizes and assuming that f{x) has Lipschitz continuous gradients with modulus M, it 
holds that 

E[fix)]-fix*)<lME[dm 

and therefore, we set 

b{dQ,K) [f\{\-2mll{e)+Bii\e)) + f^ n i'^-2mp{i)+Bp\i))p^{i) 

^ \^=l ^=l!=t!+l 

Proof. Using the descent lemma from li^ . it holds that 

Plugging in the bound from LemmafTTIvields the bound b{do,K). □ 

Next, we introduce a bound inspired by IMl for the case where ^(a:(0),... ,x{K)) corresponds to forming a convex 
combination of the iterates. 


Lemma 19. With a constant step size and averaging with 




I > 0 

lo, ifi = o 


where 

it holds that 


Y{i) = (1 -mp+Bp^) ^ 

do 


b{do,K) = 


2Mlio7W ' 2 


-Ap 


Proof By strong convexity, it holds that 

-{x{i- l)-x*,Vf{x{i- 1))) < -m\\x{£ - 1) - x*f - {f{x{£ - 1)) - f{x*)) 

Following the Lyapunov-style analysis of Lemma[T3 it holds that 

E[d{£)] < {I -mp + Bp^)E[d{£- l)]-2p{E[f{x{£- 1))]- f{x*))+Ap^ 
Rearranging, using the telescoping sum, and using convexity, it holds that 

do 


E[nx)]-fix*)< 


2mLLo7(t) ' 2 


-Ap 


If we set p = then it holds that 


b{do,K) = 6 


1 


□ 


for Lemma[T9| 

We consider an extension of the averaging scheme in OTl . The bound in this paper only works with B — 0, so we 
extend it slightly to handle B >0. 
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Lemma 20. Consider the choice of step sizes given by 


Then 


where 


b{do,K) = 


m = W>1 


1 + ^m{K + 1) {K + 2) 
E[t/(£)] < Jif) 


Note that we can use the bound in Lemma WA here. 
Proof. We have using Lyapunov style analysis 


E[die)] < {l-2mfi{i)+Bp^{e))E[d{£-l)]-2p{£){E[f{x{i))]-f{x*))+Ap^{e) 


Then we have 


It holds that 


p^(£) 




1 — 2mp{£) 

p^{£) 


+ B ) E[di£-l)] - _(E[/(a;(£))] -/(a;*) +A 


l-2m^(£) 1 


1 1 
■ — 2m- 


As long as we have 

then we get 


p^{£) ii^{£-l) p^{£) p{£) p^(£-l) 

£^ 2m£ (£-lf 

C^~~c C2 
2(mC-l)L-l 
“ C2 


mC —1<1 C< — 

m 


j^ndi£)] - -^^^^E[di£-l)] < BE[d{£- 1)] - -^^(E[f {x {£))]-+A 


Summing an rearranging yields 


e=o P (^) 

with /r (0) = 1 by convention. With the weights 


t iE[f{xm-n^l) < + 1 )A + f E[di£)] 


e=o 


ri£) = 




yt 1 

^>=0 mItt 


we have 


Then it holds that 


E[/(ai(/r))]-/(a;*)< 


^diO) + ^{K+l)A + iB^UE[d{£)] 


1 

-r=0 Jftj 


K K Y 

^ = I + ^mx — I -m{K -£ 1){K -\-2) 
T=0 T=1 ^ 
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so 


nf{x{K))]-nx*)< 


i^/(0) + i(/:+l)A + iBlioW)] 


l+^j”t{K+l){K + 2) 


□ 


For the choice of step sizes in Lemma|20]from Lemma [T71 it holds that 

'L 


Since 


it holds that 


E[t/(£)] = ^ 


f i = ^(iog/:) 

e=i ^ 


nmm - n^i = ^ (^+^ 

Note that a rate of is minimax optimal for stochastic minimization of a strongly convex function l32l . 
Next, we look at a special case of averaging for functions such that 

E\\Va:£(x,z)-V^e{x,z)-Vl^e{x,z) (x-x) 11^ = 0 

from ll28l . For example, quadratics satisfy this condition. 

Lemma 21. Assuming that 

n^^£{x, z) - V^£{x,z) - ylj{x,z) {x-x)\\' = 0, 


we select step sizes 
with a> 112, and 


p{i) = cr 


10, if£ = 0 


it holds that 


{E[d~iK)]) 
1 


m'/2 


1/2 

■-1 

E 

k=\ 


fx[k 1) fx (/:) 








1/2 



mK V mK^ 




k=l 


with d{K) = ||ai(A') — a;*|p. If in addition f has Lipschitz continuous gradients with modulus M, then it holds that 


Proof. Suppose that we set 




x{K) = - E 
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Then it holds that 


yielding 


- X*) = V^e{x{k- l),z{k-l))-V^e{x*,z{k- 1)) 

+ - ^lj{x*,z{k- 1))] {xik- 1) - X*) 


yLf{x*)im-x*) = ^t \),z(k- 1)) - ^ E V^£{x*,z{k- 1)) 

^ k=i ^ k=i 


+ ^E [^la>f{x*)-^lj{x*,z{k-l))] {x{k-l)-x*) 


k=l 


First, we have 


K 


Y^W^£{xik-l),zik-l)) = -'£v^£{x{£-l),z{e-l)) 


k=l 


k=l 


1 ^ 1 

-Y.^Tixi£-l)-x{£)) 

1^1 1^1 

— E — t~:{x{£—^) — X*) -E — t~:{x{£) — x* 


.-7 I 


Kti\^^k+\) ii{k) 

{x{K)-x*) 


{x{£) - X*) + {x{0) - X*) 


£liK) 


Second, we have 


Third, we have 


E 


-^V^£{x*,z{k-1)) = _EE||V^^(at*,^(fe-l))|| 


<:=! 




k=l 


A 

- ;? 


E 


2B 


-Y.[^Lfix*)-Vljix*,zik-l))]{x{k-l)-x*) < —Y^E[dik-l)] 


k=l 




k=\ 


Combining these bounds with Minkowski’s inequality yields 
(mE[cf(/:)])‘/^ 

<{n^lj{x*){x{K)-x*)\\^y'^ 


K-l 

<E 

yt=l 


1 


1 


/r(k+l) ii{k) 


(E[«f(k)])'/2 + ^ (E[«f(0)])^/2 ^ ^ {nd{K)]YI^ 



2B 


V E V E2 ^ E[t/(k- 1)] 


t:=l 
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Then we have 


(ETO]) 


1/2 


A"-! 


m'/2 


k=\ 


1) II (k) 


( E [ J (^)])'/2 + - 




{E[d{Q)]) 


1/2, 




(E[J(^:)]) 


1/2 



niK \ niK^ 


tndik-l)] 


k=l 


□ 


This decays at rate ^ as long as =Cl ^ with i <«< 1. 


B Useful Concentration Inequalities 

For our analysis of both the direct and 1PM estimates, we need the following key technical lemma from 13^ . This 
lemma controls the concentration of sums of random variables that are sub-Gaussian conditioned on a particular 
filtration Such a collection of random variables is referred to as a sub-Gaussian martingale sequence. We 

include the proof for completeness. 

Lemma 22 (Theorem 7.5 of ||33l). Suppose we have a collection of random variables and a filtration {=^;}'Lo 

such that for each random variable V, it holds that 

1. E [e^^‘ I with of a constant 

2. Vi is J^i-measurable 
Then for every a G R” it holds that 

and 

p||a,l'i<-l|<exp{-^| 

with 

V = E 

(=1 

Proof We bound the moment generating function of Y!i=\ tiiVi by induction. As a base case, we have 


Vf >0 


Vf >0 


E[e“‘''i] = E 


E 




^0 


1 „2„2_2 

< e2‘^i"i-' 


Assume for induction that we have 


E 


exp< i Eja/V, 


< exp I i ( E 


2„2 1^2 
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Then we have 


E 


■ 

[2+1 I 

■ 

exp< 


[J 


= E 


= E 



' 

[»E«+j 



E 


f J ) 


exp-^ s^a,y, U“ 2 +i^ 2 +i 

^J+i 


(a 


y E 


(b) 

< E 


exp I ^ f 

exp J i ^ UiVi 




^■+1 


■ff 


1 rr 2 «2 ,.2 


< exp <1 2 [ ^ 


. 2.2 \ ^2 


where (a) follows since ^ 4=1 ^iVi is measurable, (b) follows since 

1 1 2 2 

^ 4<Tr, lal, .s 

^j+l 


E 




< e3°^7+i«j+i 


and (c) is the inductive assumption. This proves that 


E 


exp i ^ GiVi 


/=! 


<exp<j - \ <exp<j -vs' 


Using the Chernoff bound im, we have 




exp < s ^ OiVi 


t=i 


, 1 2 

< exp-1 —sf + - vs 


Optimizing the bound over s yields 


E«'^>fpexp<j- — 


The proof for the other tail is similar. 

If the random variables instead satisfy 

1. E [exp {s (V,- — E [V; I )} I with c? a constant 

2. Vi is -measurable 

then Lemma l2^ can be applied to {Vi — E [V; | ^i-i] to yield 

I 

If we can upper bound the conditional expectations 

E [U I < C,-, 

by 1 -measurable random variables C/, then we have 


'^GiVi > '^GiCi + t !■ < E-j '^aiVi > ^a,E [V,- | +f |> < exp| - — 


=1 /=! 


□ 


For our analysis, we generally cannot compute E [V,- | J^i-\\, but we can find “nice” C,-. 

To find af for use in Lemmal22l we frequently use the following conditional version of Hoeffding’s Lemma. 
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Lemma 23 (Conditional Hoeffding’s Lemma). If a random variable V and a sigma algebra satisfy a <V <b and 
E[y= 0, then 


E \e^^ I < exp I i (Z? — I 

Proof We follow standard proof of Hoeffding’s Lemma from mi. Since e“ is convex, it follows that 


„ b — x^ 

< --e"' 


X —a ,.u 

-e ° a< X < b 


b — a b — a 

Therefore, taking the conditional expectation with respect to ^ yields 


E [.->■ I ^1 < 

L I J b — a h — r, 


b — a 


( 22 ) 


Let h = s{b — a), p = and L[h) = —hp P\og{l — p + pe^). Then we have 

eL(h) = 


since E [V \ = 0. Since L(h) = L'{h) = 0 and L"{h) < it holds that L{h) <\{b — a)^s^. Combining this bound 

on L{h) with (ESI and (l23t yields the result. □ 

Before proceeding with our analysis, we need to introduce a few useful concentration inequalities for sub-Gaussian 
vector-valued random variables. First, for a scalar random variable dehne the sub-Gaussian norm 




b — a 


b — a 
‘Sr] 




b — a 


b — a 


(23) 


t(^) = inf 


a > 0 


E[e"^] < Vs > o| 


(24) 


Clearly, if t(^) < -|-oo, then ^ is sub-Gaussian. Second, for a random vector v in define 

(25) 

i=l 


where (it), is the component of v. We dehne v to be sub-Gaussian if B{v) < -foo. 

Of crucial importance in our analysis is analyzing the norm of an average of vector-valued sub-Gaussian random 
variables. The following lemma describes how to control the sub-Gaussian norm in such a situation. 

Lemma 24. Suppose that is a collection of independent sub-Gaussian random variables in Then it holds 

that 

If in addition the random variables satisfy 


max max ,) < 


then it holds that 
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Proof. We analyze one component of the sum ^ v,-. It holds that 


E 




1=1 


= E 






< 


(=1 

K 

E 

(=1 


1 1 




This implies that 


1?L^0 I 


K 


1=1 


1=1 


and so 




Finally, if T^((t>,)y) < T^, then we have 




Kti 


< 


< 


d K 


sli 

^ 7=1 V /=1 




iV?' 


i=i 


Tt/ 


□ 


Example 3.2 from ini, a consequence of Theorem 3.1 in ini, is useful for the concentration of the norm of 
sub-Gaussian vector random variables. 

Lemma 25 (Example 3.2 of ifTTl l. Ifv is a random vector in with B{v) < +oo, then 

r{||.,||>,}<2exp|-^^} 

Einally, we will also need to deal with dependent random variables that are sub-Gaussian with respect to a particular 
filtration. 


Lemma 26. Suppose that a random variable V and a sigma algebra ^ satisfies 

1. E[y I .F] = 0 

2. P{|y|>f|.^}< 2e^‘^^ with c a constant. 

Then it holds that 

E[e^'^|,F]<expm^).2| 

for all i > 0. 
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Proof. Adapted from the characterization of sub-Gaussian random variables in ifTSll . First, we have for any a <c that 


Setting a = 
Since E [V \ 


E 






< 1+/ 2ate“*^¥{\V\>t\.^}dt 

Jo 

< 1 + 

Jo 

2a 


= 1 + - 


c — a 


5 yields the bound 


E 


aV^ 




<2 


J^\ = 0, by a Taylor expansion we have 


E I ,^1 = 


1 + ril-y)E\{sV) 
Jo L 


2^y,V 




dy 


s^\ .J 

< ( 1 + — I 

a 


< exp 


2a 

1 /9 


expi2 


□ 
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